Topic Modelling Experiments on Hellenistic Corpora
نویسندگان
چکیده
The focus of this study is Hellenistic Greek, a variation of Greek that continues to be of particular interest within the humanities. The Hellenistic variant of Greek, we argue, requires tools that are specifically tuned to its orthographic and semantic idiosyncrasies. This paper aims to put available documents to use in two ways: 1) by describing the development of a POS tagger and a lemmatizer trained on annotated texts written in Hellenistic Greek, and 2) by representing the lemmatized products as topic models in order to examine the effects of a) automatically processing the texts, and b) semi-automatically correcting the output of the lemmatizer on tokens occurring frequently in Hellenistic Greek corpora. In addition to topic models, we also generate and compare lists of semantically related words.
منابع مشابه
Building and Modelling Multilingual Subjective Corpora
Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sent...
متن کاملTopic Stability over Noisy Sources
Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the...
متن کاملEvaluating a Topic Modelling Approach to Measuring Corpus Similarity
Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been ...
متن کاملAdaptive topic - dependent language modelling using word - based varigrams
This paper presents two extensions of the standard interpolated word trigram and cache model, namely the extension of the trigram model by useful word m{grams with m > 3 resulting into a varigram model , and the addition of topic{speciic trigram models. We give the criteria for selecting useful m{grams and for partitioning the training corpus into topic{ speciic subcorpora. We apply both extens...
متن کاملLau, Jey Han, David Newman and Timothy Baldwin (to appear) On Collocations and Topic Models, ACM Transactions on Speech and Language Processing
We investigate the impact of pre-extracting and tokenising bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017